Understanding sampling informs experimental design (Week 4 onward). How many samples do we need and are our samples representative?
Recognising sample (not sampling) distributions helps us choose the right statistical model – e.g. t-test to compare two means that are normally distributed.
Most statistical techniques use sample statistics for interpretation, e.g. the t-test can be explained using confidence intervals, and the ANOVA test can be interpreted in part using means and standard errors.
All of these concepts will make more sense as we go through the course, but if you do not try to understand them now, you will struggle.
Samples, populations and statistical inference
Populations and samples
Populations
All the possible units and their associated observations of interest
Scientists are often interested in making inferences about populations, but measuring every unit is impractical
Samples
A collection of observations from any population is a sample, and the number of observations in it is the sample size
We assume samples that we collect can be used to make inferences about the population
NEW: Samples need to be representative of the population
Statistics vs parameters
Characteristics of the population are called parameters (e.g. population mean or population regression slope)
Characteristics of the sample are called statistics (e.g. sample mean or sample regression slope) – they are used to estimate the population parameters
Statistics are what we use to help us understand the population
Formal statistical methods can help us make inferences about the population based on the sample – statistical inference
Not all statistical techniques are inferential, but many are
Sample data
Sample data are usually collected as variables, which are the characteristics we measure or record from each object.
Variables can be:
Categorical Variables
Nominal: categories without a natural order (e.g. colors, names)
Ordinal: categories with a natural order (e.g. ratings, rankings)
Numerical Variables
Continuous: can take any value within a range (e.g. height, weight)
Discrete: can take only specific values (e.g. counts, presence/absence)
YOU decide on what a variable represents
A numerical, continuous variable can be treated as a categorical variable if you decide to categorise it.
Examples
height (in cm) – a numerical, continuous variable, can be treated as a categorical variable if you group it into categories (short, medium, tall)
age (in years) – a numerical, discrete variable, can be treated as a continuous variable (if we allow for certain issues)
treatment (A, B, C) – a categorical variable, can be treated as a numerical variable if we assign numbers to the treatments (1, 2, 3) and assume they are ordered e.g. effect of 1 < 2 < 3 – the basis of non-parametric tests
Distribution of data
Types of probability distributions
Populations can be described by probability distributions, and by now, you should be familiar with these distributions and their properties
Normal Distribution: Bell-shaped curve, symmetric around the mean. Data is continuous
Binomial Distribution: Models success/failure outcomes in a fixed number of trials. Data is discrete
Poisson Distribution: Models count data when events occur at a constant rate. Data is discrete
Knowing the distribution of your data is important for choosing the right statistical model – although it is not always necessary.
Why? Because we lose one “degree of freedom” when we estimate the mean:
If you know the sample mean (\bar{x})
And you know all but one value in your sample
The last value is constrained - it must make the mean equal \bar{x}
Sampling distributions and CLT
What is a sampling distribution?
Distribution of a statistic (e.g., mean) calculated from repeated samples
Shows how sample statistics vary from sample to sample
Important for understanding sampling variability and making inferences
Sampling distribution of the mean
Central Limit Theorem
I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the Central Limit Theorem. The law would have been personified by the Greeks and deified, if they had known of it.”
– Sir Francis Galton, 1889, Natural Inheritance
The Central Limit Theorem (CLT) states that for sufficiently large samples:
The sampling distribution of the mean follows a normal distribution
The mean of the sampling distribution equals the population mean
The standard deviation of the sampling distribution (standard error) = \frac{\sigma}{\sqrt{n}}
CLT in action
Code
# Create a skewed populationset.seed(456)skewed_pop <-exp(rnorm(10000, mean =0, sd =0.5))# Sample means for different sample sizes (ordered small to large)sample_sizes <-c(5, 30, 100)sample_labels <-factor(paste("n =", sample_sizes),levels =paste("n =", sample_sizes)) # preserve ordersample_dist_data <-lapply(sample_sizes, function(n) { means <-replicate(1000, mean(sample(skewed_pop, size = n)))data.frame(means = means, size =factor(paste("n =", n), levels =levels(sample_labels)))})sample_dist_df <-do.call(rbind, sample_dist_data)# Plotggplot() +geom_histogram(aes(x = means, y = ..density..),data = sample_dist_df,bins =30, fill ="lightblue", color ="black", alpha =0.7 ) +geom_density(aes(x = means), data = sample_dist_df, color ="blue") +facet_wrap(~size, scales ="free_x") +ggtitle("Sampling distributions for different sample sizes")
With only one sample, we are not really seeing a sampling distribution – we are just replicating the same population distribution. A sampling distribution emerges when we take multiple samples and calculate their means.
Increased sample size leads to a more accurate estimate of the population mean, reflected by the narrower distribution of the sample mean, which is captured by the standard error.
Effect of variability
Code
set.seed(1221)# Define a function to generate ggplot objectsgenerate_plot <-function(sd) { data <-rnorm(500, 1.99, sd) p <-ggplot(data =tibble(x = data), aes(x = x)) +geom_histogram(fill ="orangered", alpha =0.5, bins =50) +ggtitle(paste("SD =", sd)) +xlim(-100, 100)return(p)}# Apply the function to a list of standard deviationssds <-c(3, 6, 15, 25)plots <-lapply(sds, generate_plot)# Wrap the plotswrap_plots(plots)
Increased variability (i.e. wide range of tree heights) leads to a wider distribution of the sample mean (i.e. less precision), which is also reflected by the standard error.
CLT drives statistical inference
Because of how predictable the CLT applies to sample means, we can use this to make reasonably accurate inferences about the population mean, even if we do not know the population distribution.
A sampling distribution of the mean will be normally distributed for sufficiently large samples – how large is “sufficient” depends on the population distribution
The mean of the sampling distribution trends towards the population mean with increasing sample size
To determine how well the sample mean estimates the population mean, we use the standard error of the mean – basically a standard deviation of the sampling distribution
Standard error and confidence intervals
Standard Error of the Mean
Measures the precision of a sample mean
Describes variation in sample means – around the true population mean
Decreases as sample size increases, because we become more “confident” in our estimate
Formula
SE_{\bar{x}} = \frac{s}{\sqrt{n}}
where s is the sample standard deviation
n is the sample size
When to report SD or SE
Standard Deviation (SD)
Describes variability in your data
Stays constant regardless of sample size
Standard Error (SE)
Describes precision of your mean estimate
Decreases with larger sample size (SE = \frac{SD}{\sqrt{n}})
When reporting statistics:
Use mean ± SE to show precision of your estimate
Use mean ± SD to show spread of your raw data
SE can appear deceptively small with large sample sizes – always report sample size!
Confidence intervals
What is a confidence interval?
Range of values likely to contain the true population parameter
Level of confidence (usually 95%) indicates reliability
Wider intervals = less precise estimates
Formula for 95% CI
\bar{x} \pm (t_{n-1} \times SE_{\bar{x}})
Visualising confidence intervals
Code
# Generate sample dataset.seed(253)sample_data <-data.frame(group =rep(c("A", "B", "C"), each =30),value =c(rnorm(30, 100, 15),rnorm(30, 110, 15),rnorm(30, 105, 15) ))# Calculate means and CIsci_data <- sample_data %>%group_by(group) %>%summarise(mean =mean(value),se =sd(value) /sqrt(n()),ci_lower = mean -qt(0.975, n() -1) * se,ci_upper = mean +qt(0.975, n() -1) * se )# Plotggplot(ci_data, aes(x = group, y = mean)) +geom_point(size =3) +geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper), width =0.2) +ggtitle("Means with 95% Confidence Intervals")
We will learn more about confidence intervals in the next lecture.
Thanks for listening! Questions?
This presentation is based on the SOLES Quarto reveal.js template and is licensed under a [Creative Commons Attribution 4.0 International License][cc-by]